PSCI 2270 - Week 3
Department of Political Science, Vanderbilt University
September 15, 2023
Learning about Population from Sample
Descriptive Statistics
Necessary Math
Types of Data Collection
We often cannot survey or measure outcome among the whole set of units we are interested in \(\Rightarrow\) Target population
We then have to resort to a subset of units that we can reasonably collect data for \(\Rightarrow\) Sample
We collect the sample from the available list that ideally includes the whole population \(\Rightarrow\) Sampling frame
Simple random sampling: Every unit has an equal selection probability
e.g. random digit dialing (RDD):
Non-probability sampling: e.g. Opt-in Internet panels
Literary Digest predicted elections using mail-in polls
Source of addresses: automobile registrations, phone books, etc.
In 1936, sent out 10 million ballots, over 2.3 million returned
| Pollster | FDR’s Vote Share |
|---|---|
| Literary Digest | 43% |
| George Gallup | 56% |
| Actual Outcome | 62% |
Ballots skewed toward the wealthy (with cars, phones) \(\Rightarrow\) selection bias
| Pollster | Truman | Dewey | Thurmond | Wallace |
|---|---|---|---|---|
| Crossley | 45% | 50% | 2% | 3% |
| Gallup | 44% | 50% | 2% | 4% |
| Roper | 38% | 53% | 5% | 4% |
| Actual Outcome | 50% | 45% | 3% | 2% |
Quota sampling:
Potential unobserved confounding \(\Rightarrow\) selection bias
Republicans easier to interview within quotas (phones, listed addresses, etc.)
Descriptive (summary) statistics are numerical summaries of those measurements
Two salient features of a variable that we want to know:
\[ \color{#98971a}{\bar{x}} = \color{#d65d0e}{\frac{1}{n}} \color{#458588}{\sum_{i = 1}^{n} x_{i}} \]
What’s all this notation?
Applied to the mean:
Median more robust to outliers:
Quantile (quartile, quintile, percentile, etc):
Interquartile range (IQR): a measure of variability
One definition of outliers: over 1.5 × IQR above the upper quartile or below lower quartile
\[ \text{sd} = \color{#cc241d}{\sqrt{\color{#b16286}{\frac{1}{n - 1}} \color{#98971a}{\sum_{i = 1}^{n}} \color{#458588}{(}\color{#d65d0e}{x_i - \bar{x}}\color{#458588}{)^2} }} \]
Steps:
Learning about Population from Sample
Descriptive Statistics
Probability:
Law of Large Numbers
Central Limit Theorem:
In real data, we will have a set of n measurements on a variable: \(X_1\) , \(X_2\), … , \(X_n\)
Empirical analyses: sums or means of these n measurements
Law of Large Numbers (LLN)
Let \(X_1\) , … , \(X_n\) be i.i.d. random variables with mean \(\mu\) and finite variance \(\sigma^2\). Then, \(\bar{X}_{n}\) converges to \(\mu\) as \(n\) gets large.
The normal distribution is the classic “bell-shaped” curve.
Three key properties:
Central Limit Theorem (CLT)
Let \(X_1\) , … , \(X_n\) be i.i.d. random variables with mean \(\mu\) and variance \(\sigma^2\). Then, \(\bar{X}_n\) will be approximately distributed \(N ( \mu, \sigma^2 / n )\) in large samples.
Approximation is better as \(n\) goes up \(\Rightarrow\) asymptotics
“Sample means tend to be normally distributed as samples get large.”
We usually only 1 sample, so we’ll only get 1 sample mean. So why do we care about LLN/CLT?
\[ SE = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \]
Learning about Population from Sample
Descriptive Statistics
Some Math
Interview data: Data that are collected from responses to questions posed by the researcher to a respondent
Firsthand observation: Data that may be collected by making observations in a field study or in a laboratory setting
Document analysis: Use of any audio, visual, or written materials as a source of data
Interview data can be collected
Always keep records of every answer you receive!
Sample size:
Document is understood very broadly
Running record: Materials that are collected systematically across time
Episodic records: Records produced in casual, personal, and accidental manner
Direct observation: Observing the political behavior itself
Indirect observation: Observing physical trace of the political behavior
Validity of the measure
Combine different collection strategies for reliability
Key examples:
Also to consider:
Think about possible data strategies for answering question: Which factors affect election participation?
Applying CLT/LLN to get point estimates and estimates of uncertainty
Comparing group means and logic of causal inference